Skip to content

[Bug][TMA] Drop buffer_ob gate from 1D bulk-copy eligibility#2235

Closed
mygitljf wants to merge 1 commit into
tile-ai:mainfrom
mygitljf:fix/issue-2180-tma-1d-no-split
Closed

[Bug][TMA] Drop buffer_ob gate from 1D bulk-copy eligibility#2235
mygitljf wants to merge 1 commit into
tile-ai:mainfrom
mygitljf:fix/issue-2180-tma-1d-no-split

Conversation

@mygitljf
Copy link
Copy Markdown

@mygitljf mygitljf commented May 20, 2026

Fixes #2180.

When i was looking at the issue, the 4-way split came from the 1D bulk-copy eligibility being gated on the same flag the descriptor-based 2D path uses, even though the 1D path doesn't actually need that gate.
I narrowed it to a single safety check that only matters for the descriptor variant and then added 3 regression tests covering fp32/K=1024, fp32/K=512 and fp16/K=1024 with a T.dynamic outer shape.

Verification
Pre-patch generated source for the issue's repro (extracted from the codegen string):

extern "C" __global__ void __launch_bounds__(256, 1)
gemm_kernel(__grid_constant__ const CUtensorMap A_desc, int M) {
  ...
  if (tl::tl_shuffle_elect<256>()) {
    mbar[0].expect_transaction(4096);
    tl::tma_load(A_desc, mbar[0], (&(a_shared[  0])),   0, var);
    tl::tma_load(A_desc, mbar[0], (&(a_shared[256])), 256, var);
    tl::tma_load(A_desc, mbar[0], (&(a_shared[512])), 512, var);
    tl::tma_load(A_desc, mbar[0], (&(a_shared[768])), 768, var);
  }
}

Post-patch:

extern "C" __global__ void __launch_bounds__(256, 1)
gemm_kernel(float* __restrict__ A, int M) {
  ...
  mbar[0].expect_transaction(4096);
  tl::tma_load((&(a_shared[0])), (&(A[(((int64_t)var) * (int64_t)1024)])), mbar[0], 4096);
}

A single 1D cp.async.bulk issue, no CUtensorMap descriptor — matches the expected behavior in the issue.

When the global tensor has a dynamic outer shape and the index cannot
be statically proven within bounds, `buffer_oob` is set to true. The
existing logic in `AnalyzeCopyFacts` then masked out 1D bulk-copy
eligibility along with the descriptor-based 2D path, causing
`InferLayout` to install a swizzle-shaped shared layout via
`InferBulkLayout` and forcing `Copy::Lower` to fall through to the
2D `LowerBulk`. There the `inner_box_dim > 256` branch issues
`(K / 256)` separate `tl::tma_load` calls instead of a single 1D
bulk copy.

The 1D bulk-copy path emits `cp.async.bulk` (not `cp.async.bulk.tensor`)
and has the same OOB semantics as a plain `T.copy()` - it does not
need the descriptor-only OOB gate. Drop `!ctx.buffer_oob` from the
1D eligibility check so dynamic-outer-shape 1D copies that already
satisfy `CheckBulkCopy1D` (contiguous innermost slice, full-extent
trailing dim, element count match) keep the single-issue path.

Repro: see issue tile-ai#2180 - `T.tma_copy(A[var, 0:K], a_shared, barrier=mbar)`
with `M=T.dynamic('M')`, `K=T.const('K')` on `A[M, K]` was lowering to
4 split `tl::tma_load` calls; with this patch it lowers to a single
`tl::tma_load(smem, gmem, mbar, total_bytes)` and no `CUtensorMap`
descriptor. Verified locally on A100 at the codegen-string level
(end-to-end launch requires sm_90+ and is left to CI).

Tests: 3 new regression cases in
`testing/python/language/test_tilelang_language_tma_1d.py` covering
fp32/K=1024, fp32/K=512, fp16/K=1024 with `T.dynamic` outer shape.
All existing TMA tests pass without regression.
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a334b809-3667-4295-85bf-14b36bf4949b

📥 Commits

Reviewing files that changed from the base of the PR and between 5a22d62 and 73589fc.

📒 Files selected for processing (2)
  • src/backend/cuda/op/copy_analysis.cc
  • testing/python/language/test_tilelang_language_tma_1d.py

📝 Walkthrough

Walkthrough

Fixed TMA availability gating in copy_analysis.cc to enable the 1D bulk-copy code path by removing the ctx.buffer_oob requirement, preventing unnecessary descriptor-based TMA splitting. Added four regression tests validating single tl::tma_load emission and absence of CUtensorMap for Issue #2180.

Changes

Issue #2180 TMA 1D load path fix

Layer / File(s) Summary
TMA availability gating fix in copy_analysis
src/backend/cuda/op/copy_analysis.cc
Changed layout_dependent_tma_available logic to enable 1D bulk-copy path when layout map exists and target is not CuTeDSL, removing prior !ctx.buffer_oob gating. Added comments explaining OOB handling for descriptor-based 2D TMA versus 1D bulk-copy paths.
Regression test helpers and test suite
testing/python/language/test_tilelang_language_tma_1d.py
Added _lower_issue_2180_kernel to JIT/lower GEMM kernels with dynamic M and tma_copy targeting sm_90a, and _check_single_1d_tma to validate generated code contains exactly one tl::tma_load( and omits CUtensorMap. Implemented four test functions for K={1024,512} across fp32 and fp16, and updated main() to invoke these tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • tile-ai/tilelang#2137: Fixes 1D TMA store layout inference in copy.cc and adds matching 1D tma_store regression tests, addressing the same descriptor-vs-bulk-copy lowering issue from a different angle.
  • tile-ai/tilelang#1989: Targets 1D TMA codegen correctness and adds regression tests asserting compiled kernels emit 1D TMA (cp.async.bulk.tensor) and run correctly.
  • tile-ai/tilelang#766: Demonstrates 1D TMA/bulk-copy path usage in an example with the Scale tensor, providing a practical reference for the same TMA code path.

Poem

🐰 A TMA that split into four became one,
No more CUtensorMap, the 1D path is won!
With layout maps checked, the gating grows wise,
Tests guard the fix from future demise. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: fixing the OOB gate condition for 1D TMA bulk-copy eligibility, directly addressing issue #2180.
Linked Issues check ✅ Passed The PR directly addresses issue #2180 by removing the !ctx.buffer_oob gate from layout-dependent TMA logic and adds regression test coverage for the fix.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the TMA 1D bulk-copy issue: core fix in copy_analysis.cc and regression tests in test_tilelang_language_tma_1d.py.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@LeiWang1999 LeiWang1999 self-requested a review May 21, 2026 01:41
@mygitljf mygitljf marked this pull request as draft May 21, 2026 16:41
@mygitljf mygitljf closed this May 22, 2026
@mygitljf mygitljf changed the title [Bug][TMA] Skip OOB gate for 1D TMA bulk-copy eligibility (#2180) [Bug][TMA] Drop buffer_ob gate from 1D bulk-copy eligibility May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] TMA load unnecessary spliting

1 participant